Train the Machine with What It Can Learn - Corpus Selection for SMT

نویسندگان

Xiwu Han

Hanzhang Li

Tiejun Zhao

چکیده

Statistical machine translation relies heavily on available parallel corpora, but SMT may not have the ability or intelligence to make full use of the training set. Instead of collecting more and more parallel training corpora, this paper aims to improve SMT performance by exploiting the full potential of existing parallel corpora. We first identify literally translated sentence pairs via lexical and grammatical compatibility, and then use these data to train SMT models. One experiment indicates that larger training corpora do not always lead to higher decoding performance when the added data are not literal translations. And another experiment shows that properly enlarging the contribution of literal translation can improve SMT performance significantly.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Translating from Complex to Simplified Sentences

We address the problem of simplifying Portuguese texts at the sentence level by treating it as a "translation task". We use the Statistical Machine Translation (SMT) framework to learn how to translate from complex to simplified sentences. Given a parallel corpus of original and simplified texts, aligned at the sentence level, we train a standard SMT system and evaluate the "translations" produ...

متن کامل

Linguistically Annotated BTG for Statistical Machine Translation

Bracketing Transduction Grammar (BTG) is a natural choice for effective integration of desired linguistic knowledge into statistical machine translation (SMT). In this paper, we propose a Linguistically Annotated BTG (LABTG) for SMT. It conveys linguistic knowledge of source-side syntax structures to BTG hierarchical structures through linguistic annotation. From the linguistically annotated da...

متن کامل

Domain Adaptation for Machine Translation with Instance Selection

Domain adaptation for machine translation (MT) can be achieved by selecting training instances close to the test set from a larger set of instances. We consider 7 different domain adaptation strategies and answer 7 research questions, which give us a recipe for domain adaptation in MT. We perform English to German statistical MT (SMT) experiments in a setting where test and training sentences c...

متن کامل

A Graph-based Bilingual Corpus Selection Approach for SMT

In statistical machine translation, the number of sentence pairs in the bilingual corpus is very important to the quality of translation. However, when the quantity reaches some extent, enlarging the corpus has less effect on the translation quality; whereas increasing greatly the time and space complexity to train the translation model, which hinders the development of statistical machine tran...

متن کامل

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

Named entity recognition (NER) is a natural language processing (NLP) problem that is mainly used for text summarization, data mining, data retrieval, question and answering, machine translation, and document classification systems. A NER system is tasked with determining the border of each named entity, recognizing its type and classifying it into predefined categories. The categories of named...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Train the Machine with What It Can Learn - Corpus Selection for SMT

نویسندگان

چکیده

منابع مشابه

Translating from Complex to Simplified Sentences

Linguistically Annotated BTG for Statistical Machine Translation

Domain Adaptation for Machine Translation with Instance Selection

A Graph-based Bilingual Corpus Selection Approach for SMT

پیکره اعلام: یک پیکره استاندارد واحدهای اسمی برای زبان فارسی

عنوان ژورنال:

اشتراک گذاری